The i-Technology Media!
Register | Log in
   
 
.NET  ·  AJAX  ·  CLOUD  ·  ECLIPSE  ·  FLEX  ·  OPEN WEB  ·  iPHONE  ·  JAVA  ·  LINUX  ·  OPEN SOURCE  ·  ORACLE  ·  PBDJ  ·  SEARCH  ·  SILVERLIGHT  ·  SOA  ·  VIRTUALIZATION  ·  WEB 2.0  ·  WIRELESS  ·  XML
Comments
Improving the Efficiency of SOA-Based Applications
jhv1blz5 wrote: The article validated SOA as an IT architecture paradigm that can be leveraged in many ways. Taking data storage, scalability and application performance to a nifty level using SOA Application Grid infrastructure will no doubt enhance data and application performance on Oracle architecture platforms, it also has the promise of a cost effective and efficient IT delivery model. The very benefits of SOA.
Jul. 3, 2009 10:31 AM EDT
Cloud Computing | Virtualization
November 2 - 4
Register Today and SAVE !..
Did you read today's front page stories & breaking news?
Live Google News by SYS-CON!

Top Three Links You Must Click On


CF Basics
Creating Dynamic Websites With ColdFusion
The CF Apprentice Series - Part 4: Verity Free Text Search

By: Michael Smith
Sep. 16, 2003 12:00 AM

In this article we continue to look at what ColdFusion is and how you can use it for dynamic website creation. We cover free text searching of both multiple files and databases using the ColdFusion Verity text search engine. Free text searching lets you look for words anywhere in a directory structure or database.

What is ColdFusion?
In case you missed the last article that introduced ColdFusion, let me explain what it is. ColdFusion, which was introduced by Allaire in 1995 and is currently on version 4.5.1, is a programming language based on standard HTML (Hyper Text Markup Language) that is used to write dynamic webpages. It lets you create pages on the fly that differ depending on user input, database lookups, time of day or any other criteria you dream up. ColdFusion pages consist of standard HTML tags such as <FONT SIZE="+2"> together with CFML (ColdFusion Markup Language) tags such as <CFQUERY>, <CFIF> and <CFLOOP>.

Text searching in ColdFusion
Free Text searching is a very powerful programming tool that lets you search thousands of files or database records for any text any where within them. ColdFusion implements text searching looping with Verity using the <CFSEARCH> and <CFINDEX> tags. The search language allows for:

  • Wildcards – regular expression style use of ?, *, [], -, ^
  • Evidence operators – STEM, WILDCARD, WORD
  • Proximity operators – NEAR, PARAGRAPH, PHRASE, SENTENCE
  • Relational operators – CONTAINS, MATCHES, STARTS, ENDS, SUBSTRING
  • Concept operators – AND, OR, ACCRUE
  • Score operators – YES, NO, PRODUCT, SUM, COMPLEMENT
You could write hundreds of lines of code to do these kinds of searches yourself, but it would run orders of magnitude slower than using the single <CFSEARCH> tag. This is because Verity creates a word lookup index (or collection) of every piece of text in your files or records so that it can go straight to the ones you are are searching for. This is analogous to an index in a book that lists all the pages that a certain word appears on. If you imagine how tedious it would be to search for words in a book without an index, it will give you an idea of the advantages Verity can give your ColdFusion programs.

The Verity Engine
The free text indexing and searching functionality in ColdFusion is based on Verity, Inc.'s SEARCH'97 product. Indexing data is available both through the <CFINDEX> tag and the ColdFusion Administrator, where you can create and manage collections. Searching is done using the <CFSEARCH> tag. Output of search results to your pages is done using the same <CFOUTPUT> tag that you would use with database queries.

The Verity engine performs searches against collections. Collections consist of an index of all the words in all the files or records you want to search. Collection information includes:

  • Word indexes.
  • An internal documents table.
  • Logical pointers to actual document files.
In your ColdFusion application, you can populate and search multiple collections, each of which can be designed to focus on a specific group of documents or queries, according to subject, document type, location, or any other logical grouping. Searches can be performed against multiple collections, giving you lots of flexibility in designing your search interface.

The <CFINDEX> tag lets you manage the data in an existing collection, including:

  • Indexing text or binary data in specified directories, or indexing ColdFusion queries.
  • Purging a collection of data.
  • Updating, refreshing, and optimizing a collection.
Creating a Verity collection
However, before you can perform any of these operations using <CFINDEX>, you need to create the collection in the ColdFusion Administrator. This is somewhat similar to how you have to create a datasource for SQL queries in the Administrator. Here are the steps for creating a collection:
  1. Open the ColdFusion Administrator Verity page.
  2. Enter a name for your collection. The Administrator fills in the Collection Root path with a corresponding directory path.
  3. Click Create. The new collection name and path appear in the Verity Collections List.

 
Figure 1 - Creating a Verity Collection in the ColdFusion Administrator

Once your collection is created, you can use either the Administrator or the <CFINDEX> tag to populate it with documents to search. Generally I use the Administrator for static data and the <CFINDEX> tag for data that changes and must be frequently re-indexed.

Here are some ideas on using Verity in your applications:

  • Index your Web site and provide a generalized search mechanism, such as a form interface, for executing searches.
  • Index specific directories containing ASCII documents for subject-based searching.
  • Index ColdFusion queries, giving your end-users the ability to perform custom queries against data you've indexed. Since collections are made up of data optimized for retrieval, this method is much faster than performing multiple database queries to return the same data.
  • Manage and search collections generated outside of ColdFusion using native Verity tools. This additional capability requires only that the full path to the collection be specified in the index command.
  • Index e-mail generated by ColdFusion application pages and create a searching mechanism for the indexed messages.
  • Build collections of inventory data and make those collections available for searching from your ColdFusion application pages.
  • Support international users in a range of languages from both the <CFINDEX> and <CFSEARCH> tags.
Indexing documents
ColdFusion allows you to index and search collections populated with data from:
  • ASCII text files.
  • Binary Office documents (see below for details about document types).
  • ColdFusion queries resulting from data returned by a <CFQUERY> operation.

You can index libraries of HTML and CFML documents and other ASCII text files. Choose specific documents or an entire directory tree as the target of your collection. Collections can be stored anywhere, so you have a lot of flexibility in accessing indexed data. This adds enormous value to any content-rich Web site.

For example, at TeraTech we are always coming across useful e-mails, documents, code snippets, web pages and newsgroup references. We never knew how to store these effectively for future reference. Paper printouts were hard to search and share in a team, and our existing computer copies were not much better. So we came up with a simple knowledgebase by creating a straightforward directory-based system that can be searched by Verity. (It also has the added advantage of being very easy to save documents to. If you make it to hard to save documents for reference, there will be no documents to search — and a knowledgeabase is useless if no one uses it!) This is why we prefer saving text documents to a simple directory system, instead of trying to be sophisticated and saving it in a database.

Whenever a document is located — in e-mail, news groups, or from the Web — that is found to have some reference value, it is saved to the knowledgebase directory on our shared X: drive. It is useful to give the file a long, descriptive name, since this will basically be the title of the document when search results are returned. We have found that Eudora e-mail convieniently saves e-mail messages with a file name based on the subject of the message!

The ColdFusion code to create the Verity collection for our knowledge base of documents is:

View the code for create_collection.cfm

Here we are refreshing a collection named KnowledgeBase that is stored in the directory X:\knowledgebase\. The recurse parameter tells Verity to index all subdirectories too. The extensions parameter lists the file types to index.

Note: if X: is not a physical drive on the ColdFusion server, you may have to refer to it by a UNC (Universal Naming Convention) such as \\mswebserver\x-drive. This is because by default the ColdFusion process runs without logging into the machine, and so it doesn't see mapped drive letters such as X:.

The knowledgebase directory is broken down into common developer's areas of interest, such as JavaScript, ColdFusion, ASP, Access97, VB, HTML, etc. New directories can be added as needed. The directories are not really necessary as far as Verity is concerned, but are useful to prevent information scramble/overload (and in case we ever want to do any clean-up of the data).

For many documents the <CFINDEX> tag can take some time to run (on our site it takes 45 seconds on average for 1000 documents). To avoid user delays and still keep the collection up to date as new documents are saved, we use the ColdFusion scheduler to automatically run the above refresh action at 6 am every day. A <CFMAIL> tag e-mails me to confirm that the command ran ok.

View the code for collection_timer_advise.cfm

Indexing queries
In addition to indexing documents, Verity can index your output from a <CFQUERY>. Of course you could do this in SQL using the LIKE operator or the INSTR() function, but both of these methods use full table scans and so are slow on any but the smallest databases. Another advantage is that the search interface is simple both for the user and for you coding it, as typically you have one input field that is searched through all fields in the database.

Document types supported

Verity supports a wide array of binary document types. This means you can index word processing, spreadsheet, and other document types and produce search results that include summaries of these documents.

The following document types are supported:

  • ASCII text
  • Adobe Acrobat, PDF
  • Ami Pro
  • WordPerfect
  • Word, RTF
  • Excel
  • PowerPoint

Verity also supports foreign language indexing using the ColdFusion International Language Search Pack in: German, French, Danish, Dutch, Finnish, Italian, Norwegian, Portuguese, Spanish, and Swedish.

To index a ColdFusion query:

  1. Define a logical name and location for your collection using the ColdFusion Administrator Verity page.
  2. Execute a <CFQUERY> to retrieve data from the desired ODBC data source.
  3. Generate the collection using the <CFINDEX> tag.

The query set is indexed using the <CFINDEX> tag in which you specify a KEY, typically a unique value like the primary key, and the column in which you want to conduct searches, the BODY. In our example we have a database of e-mail messages to query from.

View the code for index_queryset.cfm

This <CFINDEX> statement specifies the Body column as the core of the collection and names the KEY as the Message_ID column, the table's primary key. Note that the TITLE attribute names the UserName column from the Messages table. The TITLE attribute can be used to designate an output parameter when you are displaying your Verity search results.

View the code for title_attribute.cfm

We will explain in detail how to search the collection below.

To index more than one column in a collection, enter a comma-separated list of column names for values of the BODY attribute, such as: BODY=FirstName,LastName,Company

As an alternative, you can use the concatenation function of your DBMS in a SELECT statement, such as: SELECT FIRSTNAME+' '+LASTNAME AS WHOLENAME.

  • A space is inserted between each concatenated value to avoid mixing up words. You would then generate a collection from WHOLENAME.

    Searching a Verity collection
    The <CFSEARCH> tag lets you search one or more Verity collections. Searches can either be for single words, multiple words or complex proximity operators such as within 3 words or same sentence.

    In our file based KnowledgeBase example:

     
    <CFSEARCH
    	COLLECTION="KnowledgeBase"
    	NAME="Articles"
    	TYPE="SIMPLE"
    	CRITERIA="#URL.SearchText#">
    
    We are searching the collection called KnowledgeBase with a simple word search for words contained in the URL parameter SearchText. This parameter has been passed on the URL string to our search results page. The list of files matching the search is returned in the query named Articles.

    To display the search results a pageful at a time we use the <CFOUTPUT> tag with the startrow and maxrows parameters. These would be set using paging buttons on the results page, which to save space we have not shown here. We use a table format to make the display easier to read.

    View the code for search_results.cfm

    In the output we use the standard <CFSEARCH> output columns score, url, key and summary (see below). We also use the URLEncodedFormat function in case the file name contains spaces and we add the file name on the end of the URL a second tie with spaces stripped so that if the file is downloaded it will be saved with the stripped name. For example "My Test.doc" would have URL My%20Test%2Edoc/MyTest.doc and if you clicked on the link the file name would be MyTest.doc. The target="_new" parameter of the HTML <A HREF> tag tells the browser to use a new window when you click on the link. We use the HTMLEditFormat function on the summary variable because if it contains HTML it could screw up our display — the function converts the HTML codes to displayable text.

    A full list of Verity variables is:

    • KEY – the value of the KEY attribute defined in the CFINDEX tag used to populate the collection. In our case the filename and path.
    • TITLE – Returns the value of the TITLE attribute defined by the <TITLE> HTML tag in any HTML or ColdFusion application page file that was indexed by CFINDEX. If the collection was TYPE=CUSTOM, TITLE returns the value of the TITLE attribute defined by the CFINDEX tag. If the collection was TYPE=FILE, TITLE also returns the value of the TITLE attribute defined by the CFINDEX tag.
    • SCORE – Returns the relevancy score of the document based on the search criteria from 0 to 100.
    • URL – Returns the value of the URLPATH attribute defined in the CFINDEX tag used to populate the collection.
    • SUMMARY – the best three sentences or 500 characters of documents returned by a search.
    • CUSTOM1, CUSTOM2 – user defined key fields
    • RECORDCOUNT – The total number of records returned by the query
    • CURRENTROW – The current row of the query being processed by CFOUTPUT
    • RECORDSSEARCHED – The total number of records in the index that were searched
     
    Figure 2: Verity search results page

    Verity Search Query Language
    You can do more than just search for single words using the <CFSEARCH> CRITERIA parameter. You can also enter comma-delimited strings and use wildcard characters (regular expressions). By default, a simple query searches for words, not strings. For example, entering the word "all" will find documents containing the word "all" but not "allegorical." You can use wildcards, however, to broaden the scope of the search. "all*" will return documents containing both "all" and "alliterate." Case is ignored, but only when (as above) the search string is all lowercase or all uppercase. If the criteria is mixed case ("All"), only the same case would match (only "All", not "all" or "ALL").

    Testing and Debugging Applications
    As you build your ColdFusion application pages, you can test pages by simply opening them in a browser. There is no need to compile or link your pages. You can make a tiny change and see the results of your change immediately by simply opening the page in your browser. Most ColdFusion developers run ColdFusion and a Web server locally, on their own computers, and test applications by editing and viewing or running pages side-by-side. Once your application is ready, you can very easily deploy your pages to a remote server.

    You can enter multiple words separated by commas: software, Microsoft, Oracle. The comma in a simple query expression is treated like a logical OR. If you omit the commas, the query expression is treated as a phrase, so documents would be searched for the phrase "software Microsoft Oracle."

    You can use the AND, OR, and NOT operators in a simple query: software AND (Microsoft OR Oracle). To include an operator in a search, you surround it with double quotation marks: software "and" Microsoft. This expression searches for the phrase "software and Microsoft."

    A simple query employs the STEM operator and the MANY modifier. STEM searches for words that derive from those entered in the query expression, so that entering "find" will return documents that contain "find," "finding," "finds," etc. The MANY modifier forces the documents returned in the search to be presented in a list based on a relevancy score.

    For a full list of Verity operators see the on-line help page at our Knowledge Base page www.teratech.com/knowledgebase/. You can also try out our Verity Knowledge Base too!

    Summary
    In this article we learned how to index both documents and large database queries for free text searches using Verity. We used the <CFINDEX> and <CFSEARCH> tags together with a <CFOUTPUT> to display results.

    Creating Dynamic Websites With ColdFusion
        — The CF Apprentice Series

    Part 1: What is ColdFusion?
    Part 2: Loops and Lists
    Part 3: Dynamic E-Mail

  • Published Sep. 16, 2003— Reads 6,150
    Copyright © 2003 SYS-CON Media, Inc. — All Rights Reserved.
    Syndicated stories and blog feeds, all rights reserved by the author.
    About Michael Smith
    Michael Smith is president of TeraTech (www.teratech.com/), an
    11-year-old Rockville, Maryland-based consulting company that
    specializes in ColdFusion, database, and Visual Basic development.

    Add Your Feedback

    In order to post a comment you need to be registered and logged in.

    Register | Sign-in

    Reader Feedback: Page 1 of 1

    Subscribe to the World's Most Powerful Newsletters
    Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

    SYS-CON Featured Whitepapers
    ADS BY GOOGLE
    Breaking Java News
    Tropos-Based Wireless Broadband Network Launched Along Scenic Grand Canal
    Rates established for westbound coal through end of Q1 2010
    Tipp24.com Customer Wins 1 Million Euros by Matching Just 3 Numbers
    Celtic Pharma Holdings Announces Investment in Novacta Biosystems Limited
    Six Promising Companies Selected to Present to Leading Mobile Investors at Inaugural MobiTechFest Europe 2009
    Staveley Head: Electric Powered Vans Is the Way Forward Says the Dept. for Transport
    MorphoSys Secures Full-Term of Strategic Alliance

    ADVERTISE   |   MAGAZINE SUBSCRIPTIONS   |   FREE BREAKING-NEWSLETTERS!   |   SYS-CON.TV   |   BLOG-N-PLAY!   |   WEBCAST   |   EDUCATION   |   RESEARCH

    .NET Developer's Journal - .NETDJ   |   ColdFusion Developer's Journal - CFDJ   |   Eclipse Developer's Journal - EDJ   |   Enterprise Open Source Magazine - EOS
    Open Web Developer's Journal - OPENWEB   |   iPhone Developer's Journal - iPHONE   |   Virtualization - Virtualization   |   Java Developer's Journal - JDJ   |   Linux.SYS-CON.com
    PowerBuilder Developer's Journal - PBDJ   |   SEO / SEM Journal - SJ   |   SOAWorld Magazine - SOAWM   |   IT Solutions Guide - ITSG   |   Symbian Developer's Journal - SDJ
    WebLogic Developer's Journal - WLDJ   |   WebSphere Journal - WJ   |   Wireless Business & Technology - WBT   |   XML-Journal - XMLJ   |   Internet Video - iTV
    Flex Developer's Journal - Flex   |   AJAXWorld Magazine - AWM   |   Silverlight Developer's Journal - SLDJ   |   PHP.SYS-CON.com   |   Web 2.0 Journal - WEB2
    Apache   |   CMS   |   CRM   |   HP   |   Oracle Journal   |   Perl   |   Python   |   Red Hat   |   Ruby on Rails   |   SAP   |   SaaS

    SYS-CON MEDIA:   ABOUT US   |   CONTACT US   |   COMPANY NEWS   |   CAREERS   |   SITE MAP
    SYS-CON EVENTS:   |  AJAXWorld Conference & Expo  |  iPhone Developer Summit  |  Cloud Computing Conference & Expo  |  SOA World Conference & Expo  |  Virtualization Conference & Expo
    INTERNATIONAL SITES:   India  |  U.K.  |  Canada  |  Germany  |  France  |  Australia  |  Italy  |  Spain  |  Netherlands  |  Brazil  |  Belgium
     Terms of Use & Our Privacy Statement     About Newsfeeds / Video Feeds
    Copyright ©1994-2008 SYS-CON Publications, Inc. All Rights Reserved. All marks are trademarks of SYS-CON Media.
    Reproduction in whole or in part in any form or medium without express written permission of SYS-CON Publications, Inc. is prohibited.
     
    close this window