TITLE
  Parsing of Long Words
APPLICATION
 mg-1, mg-2
TYPE
  bug
REPORT
  tim@cosc.canterbury.ac.nz - May 11th 1994
FIX
  tes@kbs.citri.edu.au - August 9th 1994
CLAIM
  Mg didn't handle long words properly; it crashed.
PROBLEM
  Invf passes calls PARSE_LONG_WORD [words.h] which uses a limit of
  MAXLONGWORD on iterating thru the string and storing into 
  a word. MAXLONGWORD = 8192.
  However, mg strings generally store the length in the first
  byte limiting them to 255 characters. The word which was passed
  to PARSE_LONG_WORD was an allocated string of MAXSTEMLEN = 255,
  which is as large as we should get anyway. Thus when accessing
  a larger word than 255 chars, PARSE_LONG_WORD would allow it
  (less than 8192) and would try storing beyond the array limit.
SOLUTION
  The author can't remember why PARSE_LONG_WORD was used and what
  the significance of MAXLONGWORD = 8192 is.
  So PARSE_LONG_WORD has been changed to PARSE_STEM_WORD which
  uses MAXSTEMLEN as its limit.
FILES
 *   words.h
 *   invf.pass1.c
 *   invf.pass2.c
 *   ivf.pass1.c
 *   ivf.pass2.c
 *   query.ranked.c
*************************************************************
TITLE
  Use of Lovins stemmer
APPLICATION
  mg-1
TYPE
  improve
REPORT
  local - 1994
FIX
  linh@kbs.citri.edu.au - 1994
CLAIM
  Stemming was done naively.
PROBLEM
  Only a few types of words and their endings
  were considered.
SOLUTION
  Replacement with a more elaborate "known" stemmer by Lovins.
  The algorithm is described in:
    J.B. Lovins, "Development of a Stemming Algorithm",
    Mechanical Translation and Computational Linguistics, Vol 11,1968.
FILES
 *   stem.c
 *   stem.h
*************************************************************
TITLE
  Different term parsing
APPLICATION
  mg-1
TYPE
  bug
REPORT
  tes@kbs.citri.edu.au - 23 Aug 1994
FIX
  tes@kbs.citri.edu.au - 23 Aug 1994
CLAIM
  Boolean queries did not extract words/terms using the
  same method as is done at inverted-file creation and
  as is used for rank query parsing.
PROBLEM
  The hand-written lex. analyser, query_lex, which is called by
  the boolean query parser was not calling a common
  word-extraction routine as used by the rest of mg. 
  This would be ok if the code did the same things - but they didn't.
  Query_lex, for instance, did NOT place any limit on the
  number of digits in a term.
  Of even more concern, it would allow arbitrary sized words
  although it used Pascal style strings which store the length
  in the first byte and can therefore only be 255 characters in length.
SOLUTION
  Query_lex in "query.bool.y", was modified to call the routine
  PARSE_STEM_WORD which is also used by text-inversion routines and
  ranking query routines.
  Now all terms are extracted by the same routine.
  To do this, the end of the line buffer had to be noted as
  PARSE_STEM_WORD requires a pointer to the end - which is the
  safe thing to do (don't want to run over the end).
  This meant I had to find the length of the query line buffer.
  This was allocated in the file "read_line.c" by the routine,
  "readline". Its size was the literal number 1024. 
  This was changed to a constant and placed in "read_line.h".
  The definition for PARSE_STEM_WORD can be found in "words.h".
FILES
 *   query.bool.y
 *   query.bool.c (by bison)
 *   read_line.c
 *   read_line.h
*************************************************************
TITLE
  Highlighting of query terms
APPLICATION
  mg-1
TYPE
  extend
REPORT
  tes@kbs.citri.edu.au - Aug 94
FIX
  tes@kbs.citri.edu.au - Sep 94
CLAIM
  Difficult to feel happy that the query-result returned is
  satisfying the query - need to look hard to find the queried words.
  Need to show words in results using some highlighting method.
PROBLEM
  No highlighting of query terms in results.
SOLUTION
  Mgquery was previously outputting the decompressed text to a pager
  such as "less(1)" or "more(1)".
  (Except when redirected or piped elsewhere :)
  So what was needed was some sort of highlight pager that instead of
  displaying the text would also use some means for highlighting the
  stemmed query words. 
  Two common forms of highlighting were chosen: underline and bolding.
  These are supported by "less(1)" and possibly by "more(1)" by
  using the backspace character.
  A highlight pager will also need to know which words need to be
  highlighted. Therefore, the code was modified to build up a
  string of the stemmed query words for passing to the highlight pager.
  Design Options:
  ---------------
    * Could do text filtering in mgquery before passing out to pager.
      Instead I pipe to a separate process, the "hilite_words" pager,
      which filters and pipes into less/more.
    * Could do different highlighting or a combination.
    * Could use a different structure for storing the query words other
      than the hash-table I used.
FILES
 *   Makefile - to include hilite_words target
 *   mg_hilite_words.c
 *   mgquery.c
 *   mgquery.1
 *   query.bool.y
 *   query.ranked.c
 *   environment.c
 *   environment.h
 *   backend.h
*************************************************************
TITLE
  Mg_compression_dict did premature free
APPLICATION
  mg-1
TYPE
  bug 
REPORT
  R.Sosic@cit.gu.edu.au - 23 Sep 94
FIX
  R.Sosic@cit.gu.edu.au - 23 Sep 94
CLAIM
  mg_compression_dict dumped core in 
	file: mg_compression_dict.c
	function: Write_data
	line: int codelen = hd->clens[i];
PROBLEM
  Huffman data, hd, was freed *before* it was accessed again. 
SOLUTION
  The freeing of hd has been moved to after all accesses
  (just before returning).
FILES
 *   mg_compression_dict.c
*************************************************************
TITLE
  Boolean tree optimising rewrite
APPLICATION
  mg-1
TYPE
  bug
REPORT
  sosic@kurango.cit.gu.edu.au - 23 Sep 94
FIX
  tes@kbs.citri.edu.au - Oct 94
CLAIM
  "I am still getting core dump in "and" queries in mgquery, 
   where the first word does not exist, but the second one does."
PROBLEM
  Having freed a particular node, it tried to refree it and 
  access one of its fields.

  I.e. code-fragment...

       FreeNode(curr); /* where curr = CHILD(base) for 1st term in list */
       FreeNodes(next);
       FreeNodes(CHILD(base)); 
              /* but CHILD(base) has already been freed above */
              /* if the node was the first one in the list */

SOLUTION
  A number of things in the code seemed a bit dubious to me.
  So I have rewritten the boolean optimising stage and abstracted out
  the various stages - each file starts with "bool".
  Boolean query optimising seems to be a tricky problem. 
  It is not clear that putting an expression into a certain form will
  actually simplify it and whether simplification means faster querying.
  I have converted a given boolean expression into DNF 
  (Disjunctive Normal Form).  "And not" nodes, which are readily apparent 
  in DNF, are converted to "diff" nodes.  I have only applied the idempotency 
  laws involving TRUE and FALSE, and not the ones requiring matching of 
  expressions - it is a potentially more complicated problem.
  The optimiser has been tested by playing with "bool_tester", and if you are
  having a crash or problem in a boolean query it would be worth testing the
  query on the "bool_tester." The token "*" stands for TRUE (or all documents) 
  and the token "_" stands for FALSE (or no documents). This should show the 
  expression before and after optimisation in an ascii tree bracketting format.
FILES
 *   bool_tree.c
 *   bool_parser.y
 *   bool_optimiser.c
 *   bool_query.c 
 *   bool_tester.c
 *   term_lists.c
*************************************************************
TITLE
  Mgtic pixel placement
APPLICATION
  mg-1
TYPE
  bug
REPORT
  Bruce McKenzie - bruce@cosc.canterbury.ac.nz (21st Oct 1994)
FIX
  singlis@cs.waikato.ac.nz
CLAIM
  mgtic crashed on certain files.
PROBLEM
  Placing pixels outside of bitmap.
SOLUTION
  Changed the putpixel routine to truncate at borders of the image.
FILES
 *   mgtic.c
*************************************************************
TITLE
  Improved boolean tree optimising
APPLICATION
  mg-1
TYPE
  improve
REPORT
  tes@kbs.citri.edu.au - 12/Dec/94
FIX
  tes@kbs.citri.edu.au - 21/Dec/94, 14/Mar/95
CLAIM
  Optimising by conversion to DNF is not necessarily such
  a good idea - can actually slow things down.
PROBLEM
  The distributive law used in converting to DNF
  duplicates expressions.
SOLUTION
  Introduce a query environment variable, optimise_type = 0 | 1 | 2.
  Type 0 does nothing to the parse tree. 
  Type 2 does the DNF conversion.
  Type 1 is the new default and does the following...
    Do simple tree rearrangement like flattening. 
    Optimise for CNF queries.
FILES
 *   bool_query.c, .h
 *   bool_optimiser.c
 *   environment.c
 *   invf_get.c
 *   bool_tree.c, .h
 *   bool_tester.c
 *   lists.h
*************************************************************
TITLE
  Mgstat with non-existent files
APPLICATION
  mg-1
TYPE
  bug
REPORT
  beebe@math.utah.edu - 16 May 1994
FIX
  tes@kbs.citri.edu.au - 10 Aug 1994
CLAIM
  NaNs and Infinites would be printed out by mgstat
  if unable to open .text or .text.dict file.
PROBLEM
  The NaNs etc. were output in the column stating
  the percentage size of the file compared with the
  number of input bytes of the source text data.
  If it couldn't read the .text file with its
  header describing the number of source text bytes, then
  in working out the percentage it would divide by zero.
  Also due to some bad control flow, it wouldn't attempt to
  open the .text file if it failed when opening 
  the .text.dict file.
SOLUTION
  Only printout the percentage if we can read the header
  from the .text file.
  Read in text header irrespective of text dictionary file.
FILES
 *   mgstat.c
*************************************************************
TITLE
  nonexistent HOME bug  
APPLICATION
  mg-1, mg-2
TYPE
  bug 
REPORT
  cgn@totara.cs.waikato.ac.nz - 2/May/95
FIX
  tes@kbs.citri.edu.au - 2/May/95
CLAIM
"The big problem was that mgquery crashes when the HOME environment
 variable is not set, which is the case when it is run by the www server."
 [...] "I expect it happens when looking for $HOME/.mgrc."
PROBLEM
  The result of getenv("HOME")" was used directly in
  a sprintf call. If the environment variable HOME
  was not in existence then null would be used.
  In some C libraries sprintf will convert the 0
  string into the string "(null)" on others it will core dump.
  (For example, Solaris seems to core dump, sunos 4 seems ok).
SOLUTION
  The result from getenv("HOME")" is tested before
  being used.
FILES
 *   commands.c
*************************************************************
TITLE
  mgquery collection name preference  
APPLICATION
  mg-1, mg-2
TYPE
  improve
REPORT
  cgn@totara.cs.waikato.ac.nz - 2/May/95
FIX
  tes@kbs.citri.edu.au - 4/May/95
CLAIM
 Surely something must override mquery's preference for ./bib. 
 If MGDATA is set correctly, I think it should prefer that collection, 
 and -d should definitely override it. 
 I could always say -d . if I really wanted ./bib.
PROBLEM
Currently the priority is:
1.  Check if ./name is a directory,
     If so then use it as the collection directory.
2.  Check if ./name.text is a file,
     If so then use ./ as the collection directory.
3.  Check if mgdir/name is a directory,
     If so then use mgdir/name as the collection directory.
4.  Otherwise,
     Use mgdir/name as the database file prefix.
     This would be the case if one used "-f alice/alice".
     However, one would then not specify a final name argument
     and we'd never get here. Go figure ???
SOLUTION
Moved step 3 to the top instead.
FILES
 *   mgquery.c [search_for_collection()]
*************************************************************
TITLE
  Printout of query terms
APPLICATION
  mg-1, mg-2
TYPE
  extend
REPORT
  tes@kbs.citri.edu.au - April 95
FIX
  tes@kbs.citri.edu.au - April 95
CLAIM
  No easy way to find out the parsed and stemmed words
  used in the query. Would like to know these words
  so I can call a separate highlighting program to 
  highlight these words.
PROBLEM
  No facility available.
SOLUTION
  A ".queryterms" mgquery command was added which lists
  out the parsed/stemmed queryterms of the last query.
FILES
 *    commands.c (added CmdQueryTerms)
*************************************************************
TITLE
  mg_getrc  
APPLICATION
  mg-1, mg-2
TYPE
  extend
REPORT
  bruce@cosc.canterbury.ac.nz - 2/May/95
FIX
   - 
CLAIM
  Repeated code had to be written for different named
  gets but really the same type of parsing required.
  E.g. one might want to use a standard method for inserting
  ^Bs between paragraphs for different books. One doesn't
  want to write duplicate code for each different named book,
  rather note that each book should be filtered "book" style.
PROBLEM
  There was no way of abstracting out types of filters from
  the name of an instance of a collection.
SOLUTION
  Allow information to be given with <name, type, files>.
  This extra info can be provided in a mg_getrc file.
  See man page for mg_get for details.
FILES
 *    mg_get.sh
*************************************************************
TITLE
  Boolean optimiser #1 with `!'  
APPLICATION
  mg-1, mg-2
TYPE
  bug 
REPORT
  triffid@iconz.co.nz - 20/7/95
FIX
  tes@kbs.citri.edu.au - 27/7/95
CLAIM
  Complained about not-nodes. 
  e.g. complained about "croquet & !hedgehog"
PROBLEM
  Boolean optimiser type#1 didn't convert
  "and not"s into diff nodes.
SOLUTION
  Added code to convert '&!' to '-'.
FILES
 *   mg/bool_optimiser.c [mg-1]
 *   query/bool_optimiser.c [mg-2]
*************************************************************
TITLE
  Consistent use of stderr
APPLICATION
  mg-1
TYPE
  improve
REPORT
  beebe@math.utah.edu - 16 May 1994
FIX
  tes@kbs.citri.edu.au - 11 August 1994
CLAIM
  Inconsistent use of stdout/stderr in usage messages.
PROBLEM
  Sometimes used "printf" and sometimes used "fprintf(stderr"
  in usage messages.
SOLUTION
  All should now use "fprintf(stderr" in usage messages.
FILES
 *   mg_compression_dict.c
 *   mg_compression_dict.1
 *   mg_fast_comp_dict.c
 *   mg_fast_comp_dict.1
 *   mg_invf_dict.c
 *   mg_invf_dict.1
 *   mg_invf_dump.c
 *   mg_invf_dump.1
 *   mg_invf_rebuild.c
 *   mg_invf_rebuild.1
 *   mg_perf_hash_build.c
 *   mg_perf_hash_build.1
 *   mg_text_estimate.c
 *   mg_text_estimate.1
 *   mg_weights_build.c
 *   mg_weights_build.1
*************************************************************
TITLE
  xmg bug
APPLICATION
  mg-1
TYPE
  bug
REPORT
  cgn@cpsc.ucalgary.ca - 22 April 1994
FIX
  cgn@cpsc.ucalgary.ca - 22 April 1994
CLAIM
  "Serious problem in xmg, which I fear occurs whenever a query
   doesn't return anything."
PROBLEM
  ??
SOLUTION
  [xmg.sh 201] set rank 0
FILES
 *   xmg.sh
*************************************************************
TITLE
  Unnecessary loading of text
APPLICATION
  mg-1
TYPE
  bug
REPORT
  tes@kbs.citri.edu.au - ?? August 1994
FIX
  tes@kbs.citri.edu.au - 12 August 1994
CLAIM
  Mg was loading and uncompressing text when the
  query did not require the text.
PROBLEM
  There was no test for the query mode
  before loading and uncompressing the text.
SOLUTION
  Only load/uncompress text if query mode
  is for text, headers or silent(for timing).
FILES
 *   mgquery.c
*************************************************************
TITLE
  Man page errors
APPLICATION
  mg-1
TYPE
  bug
REPORT
  beebe@math.utah.edu - 16 May 1994
FIX
  beebe@math.utah.edu - 16 May 1994
CLAIM
  Man page errors.
PROBLEM
  See below.
SOLUTION
 "The mg_make_fast_dict.1 file has been renamed mg_fast_comp_dict.1,
  and all mg_make_fast_dict strings changed to mg_fast_comp_dict in all
  man pages.
  A large number of errors of spelling, typography, spacing, fonts,
  grammar, omitted words, slang, punctuation, missing man page
  cross-references, and man-page style have been corrected."
FILES
 *    mg_compression_dict.1
 *    mg_fast_comp_dict.1
 *    mg_get.1
 *    mg_invf_dict.1
 *    mg_invf_dump.1
 *    mg_invf_rebuild.1
 *    mg_passes.1
 *    mg_perf_hash_build.1
 *    mg_text_estimate.1
 *    mg_weights_build.1
 *    mgbilevel.1
 *    mgbuild.1
 *    mgdictlist.1
 *    mgfelics.1
 *    mgquery.1
 *    mgstat.1
 *    mgtic.1
 *    mgticbuild.1
 *    mgticdump.1
 *    mgticprune.1
 *    mgticstat.1
 *    xmg.1
*************************************************************
TITLE
  Man page overview
APPLICATION
  mg-1
TYPE
  extend
REPORT
  beebe@math.utah.edu - 
FIX
  tes@kbs.citri.edu.au - 17 August 1994
CLAIM
  "Write new mg.1 file to give a brief overview of mg, with samples
   of how to use it.  Otherwise, users are likely to be completely
   overwhelmed by the number of programs (about 20) which might need to
   be used, when in reality, only 2 or 3 are likely to be run by end
   users."
SOLUTION
  It was thought that mg.1, written by Nelson Beebe, was very useful
  but a bit too comprehensive for an introduction.
  Therefore, two man files, mgintro.1 and mgintro++.1 were written
  with the basic stuff in mgintro.1 and slightly more advanced stuff
  in mgintro++.1 . 
FILES
 *   mg.1
 *   mgintro.1
 *   mgintro++.1 
*************************************************************
TITLE
  Parse errors not bus errors
APPLICATION
  mg-1
TYPE
  bug
REPORT
  tim@cosc.canterbury.ac.nz - 2 Jun 94
FIX
  tes@kbs.citri.edu.au - 19 Aug 94
CLAIM
  "These two queries 
   (which I typed in before I knew what I was doing!!)
      > The Queen of Hearts, she made some tarts
      > "The Queen of Hearts" and "she made some tarts"
      produced the following result:
      mgquery : parse error
      Bus error
  "
PROBLEM
  What is expected to happen under boolean querying:
    Query1:
    > The Queen of Hearts, she made some tarts
    will produce a parse error due to the comma which
    is not a valid TERM.
    Query2:
    > "The Queen of Hearts" and "she made some tarts"
    will store a post-processing string
    of ''The Queen of Hearts" and "she made some tarts'' and
    will have a main boolean query of the empty string.
    This is because the postprocessing string takes in
    everything between the first quote and the last one.
    An empty string is illegal for the boolean grammar and
    hence a parse error.
  The problem stems from the fact that the processing of
  the parse tree is carried out, even though we have a 
  parse error. In the case of using an empty string to build
  a parse tree, it is likely to leave the parse tree undefined.
SOLUTION
  As soon as we find out that there is a parse-error,
  we abandon any processing of the parse tree.
FILES
 *   query.bool.y
 *   query.bool.c (generated from query.bool.y)
*************************************************************
TITLE
  Perfect hashing on small vocab
APPLICATION
  mg-1
TYPE
  bug
REPORT
  tes@kbs.citri.edu.au - July 1994
FIX
  alistair@cs.mu.oz.au - July 1994
CLAIM
  Mg could not handle small collections in the case
  where there was only a small number of unique words.
  The perfect hash function would report an error.
PROBLEM
  Rounding of the arithmetic during the calculation of the
  parameters of the perfect hash function was resulting in a
  combination of values such that the probability of a hash
  function being found was very small. This led to the limit
  on the generation loop being exceeded, and eventual
  failure.
SOLUTION
  By using ceiling rather than floor when converting from a
  floating point value to an integer parameter, the arithmetic
  is now correct for all lexicon sizes, and the probability of
  each iteration successfully generating a hash function is
  sufficiently great that with _very_ high probability the
  execution loop counter will not be exceeded unless there
  genuinely is no hash function (for example, if the lexicon
  contains two words the same there cannot be  a hash
  function).
FILES
 *   perf_hash.c
*************************************************************