A CRASH COURSE GUIDE TO MGMERGE =============================== Shane Hudson 15 November 1994 This document is intended as a note to the maintainers of mgmerge outlining the mgmerge utility, the changes that had to be made to the existing mg code to make mgmerge work, and the new source code for mgmerge. **** NOTE: **** All the NEW/CHANGED files are in the file "new_mg.tar" which you should have received with this; files that were not changed are NOT in the tar file so it will have to be extracted "on top of" the existing code. You'll figure it out :) **** DESCRIPTION: **** mgmerge adds documents to an existing mg database without the need to rebuild the database from scratch. It works by building a temporary new database from the new documents, then merging the "old" and "new" databases to give one merged database. The source consists of: mgmerge.sh The main script, a lot like mgbuild.sh mg_get_merge.sh Script to retrieve new documents, a lot like mg_get.sh mg_merge.h header file used by mg_invf_merge.c and mg_text_merge.c mg_text_merge.c Program to append one compressed text file to another mg_invf_merge.c Program to merge two inverted files and stemmed dictionaries See the man pages for mgmerge, mg_get_merge, mg_text_merge and mg_invf_merge for a brief overview. mg_get_merge.sh should really be put into mg_get.sh, with a "-merge" option if it is being called by the mgmerge utility, but I was too lazy to change it to that. The files updated by mgmerge are: *.text with mg_text_merge *.text.idx with mg_text_merge *.text.idx.wgt with mg_weights_build (after merging) *.invf with mg_invf_merge *.invf.idx with " *.invf.dict with " *.weight with mg_invf_merge or mg_weights_build *.invf.dict.blocked with mg_invf_dict (after merging) The *.text.stats file and *.text.dict file remain the same. Any other database files (ie, *.invf.dict.hash) will be out of date after a merge and will need to be recomputed. -------------------------------------------------------------------------- MG_TEXT_MERGE ============= The two "merging" utilities are mg_text_merge and mg_invf_merge mg_text_merge simply appends the compressed text for the new documents to the old compressed text file. For this to succeed, the two files should have been created with the same parameters to mg_passes and the same compression model must be used. So *.text.dict from the old (static) database is used as the model for compressing the new documents. The default "-C" option to mg_compression_dict in mgbuild was changed to "-S" so novel words in the new documents can be coded. The option must be the same in mgmerge -- it is "-S" there too. --------------- EFFECT on FILE COMPRESSION PERFORMANCE Since the compression model is not being updated,compression performance for the text will slowly degrade. The only solution is a periodic rebuild from scratch. --------------------------------------------------------------------------- MG_INVF_MERGE ============= mg_invf_merge updates the stemmed dictionary and inverted file. For reasons of simplicity, the stemmed dictionaries merged are in the ".invf.dict" format, NOT the ".invf.dict.blocked" format. So if a database is to be merged the .invf.dict file should not be deleted after the .invf.dict.blocked file used by mgquery has been created. Also for simplicity, level 3 inverted files are not supported, nor are skipped format files. Merging the inverted files requires every entry in each file to be decoded and merged if an entry for the same term appears in the other inverted file. This can be a slow process if the old inverted file is very large. Especially since each entry is decoded bit-by-bit with the bit-level I/O rouutines. As was noted on page 94 of the book "Managing Gigabytes" (hereafter called "MG"), switching from bernoulli coding (called "Bblock" in the mg source code) to gamma or delta coding for the inter-document gaps would be an advantage since they require no parameters. With Bblock coding, N (the number of documents) is a parameter and as N changes whenever documents are added, every entry must be decoded and recoded. My approach was to keep Bblock coding, but keep the N parameter constant by recording the artificial N used to code document gaps (called "Nstatic") as well as the real number of documents. Then, any entry in the old inverted file (called "IFold") that is not merged with an entry in the new inverted file (called "IFnew") can be copied directly to the merged inverted file (called "IFmerge"). This can give a significant increase in speed, since usually the size of IFnew is very small if only a few documents are being added. To store Nstatic, a field "static_num_of_docs" was added to the invf_dict_header and stem_dict_header structs defined in invf.h Any program that decodes inverted file entries also had to be changed. (The existing mg source code that I changed has been returned with this document.) Most changes were simply altering a line that called BIO_Bblock_Init() so it used the static_num_of_docs value rather than num_of_docs. So since the file format has changed, any existing collection will have to be rebuilt first even using mgmerge with it is not intended. ------------- EFFECT on INVERTED FILE COMPRESSION PERFORMANCE As more merges are accumulatively performed on a database, the compression performance will decline. The solution is an option in mg_invf_merge to do a "slow" merge where every inverted file entry is decoded and recoded using the real value of N. This is not a lot slower than a fast merge when compared to the cost of completely rebuilding the collection from scratch, and the option can be used periodically (when, say, 40% of the inverted file has been added since the previous slow merge). ----------------- THE EXACT WEIGHTS FILE Recomputing exact document weights requires a complete scan over the merged inverted file. This is a waste since the weights for old documents will (hopefully) not change much anyway, and the weights for new documents can be computed exactly when merging the inverted files, at no extra time cost. So mg_invf_merge computes the weights for new documents and updates the .weight file. The weights for old documents are left untouched. mgmerge has an option to recompute the weights file. This is not much more expensive in terms of time, and is only periodically needed. The effect of leaving the old weights unchanged is difficult to asess since the "correct" ranking of documents for a query is subjective. The frequency with which a rebuild of the weights file with the "-w" option should be done is hard to guess. But if most merges involves only a few documents, it makes sense to not bother recomputing it since the change in old weights values is very small and the weights only stray from their "true" values slowly over accumulated merging. ------------------------------------------------------------------------- EXPERIMENTAL RESULTS ==================== With some small test collections (only a few Mb of text) and a slow merge, and a weights file rebuild, mgmerge typically took under 20% of the time to completely rebuild. Using the default options (fast merge, dont rebuild weights) took around 10% or more. For larger collections, the savings were greater, especially if the added text is very small. For example, when the last 6Kb of text of the GNUBib collection was added to the rest of it (14Mb) using mgmerge, the total mgmerge time was around 20 seconds whereas a complete mgbuild took 280 seconds. Only one larger collection was tested (resource contraints prevented testing on any really large collections): two short (one-line) documents were added to the Gutenberg collection, which comprised of nearly 74Mb of source text and had an inverted file size of 9Mb. mgmerge took 42 seconds (or 142 seconds with a slow merge and rebuilding the weights file), and I'd estimate that an mgbuild on the same machine would have taken at least 20 minutes. To summarise, mgmerge is not as fast as some sophisticated method such as the fixed-length blocks described in section 5.7 of "MG", but its huge advantage is that it required almost no changes to the existing mg code and is still far better than using mgbuild every time a document needs to be added to a collection. ---------------------------------------------------------------------------- OTHER NOTES =========== Since it works like mgbuild, mgmerge needs mg_get_merge to return the SAME text each time "mg_get_merge -text .." is called. ---- One other thing discovered (but not fixed) was a bug in "invf.pass2.c": It crashes on parsed words longer than 128 characters. So if any text with words >128 chars is piped to mg_passes, the "-N1" and "-N2" options (the memory-efficient inversion method) had better be used and not the "-I1"/"-I2" option (which would only work on small collections anyway, being the memory-inefficient method).