proj home

Files   Classes   Functions   Hierarchy  

tokenizer Class Reference

Primitive parser with state. More...

#include <tokenizer.h>

Collaboration diagram for tokenizer:

List of all members.

Public Member Functions

void reset ()
 Resets token iteration.
boolc operator! () const
 Can the stream be read?
string & operator* ()
 Modify the current token.
stringcoperator() () const
 Access the current token in the stream.
void operator++ ()
 Increment the stream's index.
void atomize (stringc &atom)
 Splits strings.
void subtract (stringc &atom)
 Splits strings removing the atom.
void tokenize ()
 
delimiter set.
void stripcomment (stringc &comment)
 For each line delete right of and including the comment if it exists in the line.
void remove (stringc &token)
 Remove matching tokens.
void remove_if ()
 Remove invalid lines as interpreted by myspacer.
template<typename SPACER >
void remove_if (SPACER spacer)
 Remove invalid lines.
template<typename X >
void apply (X x)
 Apply the functional object x to each element.
void trim ()
 Trim the leading and trailing space of each token.
void trim_and_prune ()
 Remove surrounding white space and delete empty strings.
void extractfromcurrent (vector< string > &v, stringc &atom) const
 The current line is read and parsed, split by the atom and each element is written to the vector v.
ostreamprint (ostream &os) const
 Print seq.
void read (stringc &data)
 Read the data in.
void readaslines (stringc &data)
 Read the string as lines.
void readaslinesgeneral (stringc &data)
 Because special characters in strings can interfere with reading strings, this routine is a hack to remove white spaces, commas and empty lines.
 tokenizer ()
 Construct in uninitialized state.
 tokenizer (stringc &data)
 Construct with an initial string.
boolc operator== (tokenizer &t2)
 Data copy.
boolc find (string::size_type &k, stringc &atom)
 Non destructive text search.
boolc find (string::size_type &k, stringc &atom, string::size_type const k0)
 Search from k0 in the current string.
boolc atomize_next (stringc &atom)
 Search for the first atom and atomize it.
boolc atomize_next (stringc &atom, liststringi &iend_)
boolc atomize_next_tag (liststringi &i1, liststringi &i2, stringc &tag, liststringi &iend_)
 tag=cat then i1 points to <cat>, i2 to </cat>.
boolc atomize_next_tag (liststringi &i1, liststringi &i2, stringc &tag)
boolc atomize_next (liststringi &i1, liststringi &i2, stringc &atom1, stringc &atom2)
 operator stringc ()
 Output the tokenizer back as a string.

Public Attributes

liststringi current
 The current state.
list< string > seq
 Data representation.
string printdelimiter
 Each element printed is separated by the delimiter.


Detailed Description

Primitive parser with state.

Generally concerned with global scope.

This class has two roles. Firstly it is a primitive parser. Secondly it has a state so this class is used by another to point to the current token.

I refer to a stream to mean the list of string tokens with a current position in the list.

Use: read in the file and call atomize to separate the sequence into individual elements that can be interpreted by an interpreter.

Can directly operate on the data though the list class.

Example 1. Remove comments and delete any empty strings.

  tokenizer ss;
  ...
  ss.stripcomment("//");
  ss.seq.remove("");

See vrmlshape where I wrote a VRML parser using this class to hold the current state.

Definition at line 66 of file tokenizer.h.


Constructor & Destructor Documentation

tokenizer::tokenizer (  ) 

Construct in uninitialized state.

Definition at line 435 of file tokenizer.cpp.

References current, and seq.

00436   : printdelimiter("\n")
00437 {
00438   current=seq.end();
00439 }

tokenizer::tokenizer ( stringc data  ) 

Construct with an initial string.

Definition at line 441 of file tokenizer.cpp.

References reset(), and seq.

00442 {
00443   seq.push_back(data);
00444   reset();
00445 }


Member Function Documentation

template<typename X >
void tokenizer::apply ( x  )  [inline]

Apply the functional object x to each element.

Definition at line 117 of file tokenizer.h.

References seq.

Referenced by trim().

00118   { 
00119     liststringi i = seq.begin();
00120     liststringic imax = seq.end();
00121     for ( ; i!=imax; ++i )
00122       { x(*i); }
00123   };

void tokenizer::atomize ( stringc atom  ) 

Splits strings.

Definition at line 297 of file tokenizer.cpp.

References seq.

00298 {
00299   liststringi i = seq.begin();
00300 
00301   for (;i!=seq.end(); ++i)
00302     atomize(i,atom);
00303 }

boolc tokenizer::atomize_next ( liststringi i1,
liststringi i2,
stringc atom1,
stringc atom2 
)

Definition at line 229 of file tokenizer.cpp.

00235 {
00236   bool res;
00237   
00238   res=atomize_next(atom1);
00239   if (res==false)
00240     return false;
00241   i1=current;
00242 
00243   res=atomize_next(atom2);
00244   if (res==false)
00245     return false;
00246   i2=current;
00247 
00248   // Default to reset iterator to first tag.
00249   current=i1;
00250 
00251   return true;
00252 }

boolc tokenizer::atomize_next ( stringc atom,
liststringi iend_ 
)

Definition at line 186 of file tokenizer.cpp.

00190 {
00191   liststringi i = current;
00192 
00193   string::size_type k;
00194   
00195   string::size_type const atomlen = atom.length();
00196 
00197   for (;i!=iend_; ++i)
00198   {
00199     k = i->find(atom.c_str());
00200 
00201     if (k==string::npos)
00202       continue;
00203 
00204     atomize(i,atom,k);
00205     if (i->length()==atomlen)
00206     {
00207       if (atom==*i)
00208       {
00209         current=i;
00210         return true;
00211       }
00212     }
00213 
00214     ++i;
00215     atomize(i,atom,0);
00216     assert(i->length()==atomlen);
00217     current=i;
00218     return true;
00219   }
00220 
00221   return false; 
00222 }

boolc tokenizer::atomize_next ( stringc atom  ) 

Search for the first atom and atomize it.

current points to it. Essentially this is a find function.

Definition at line 142 of file tokenizer.cpp.

Referenced by misclib_testcode::tokenizertest::test10().

00145 {
00146   liststringi i = current;
00147 
00148   string::size_type k;
00149   
00150   string::size_type const atomlen = atom.length();
00151 
00152   for (;i!=seq.end(); ++i)
00153   {
00154     k = i->find(atom.c_str());
00155 
00156     if (k==string::npos)
00157       continue;
00158 
00159     atomize(i,atom,k);
00160     if (i->length()==atomlen)
00161     {
00162       if (atom==*i)
00163       {
00164         current=i;
00165         return true;
00166       }
00167     }
00168 
00169     ++i;
00170     atomize(i,atom,0);
00171     assert(i->length()==atomlen);
00172     current=i;
00173     return true;
00174   }
00175 
00176   return false; 
00177 }

boolc tokenizer::atomize_next_tag ( liststringi i1,
liststringi i2,
stringc tag 
)

Definition at line 255 of file tokenizer.cpp.

00260 {
00261   liststringi iend_=seq.end();
00262   return atomize_next_tag(i1,i2,tag,iend_);
00263 }

boolc tokenizer::atomize_next_tag ( liststringi i1,
liststringi i2,
stringc tag,
liststringi iend_ 
)

tag=cat then i1 points to <cat>, i2 to </cat>.

Definition at line 266 of file tokenizer.cpp.

Referenced by tokenizerlocal::scope(), and misclib_testcode::tokenizertest::unittest01().

00272 {
00273 //cout << "atomize_next_tag ";
00274   string tag1="<"+tag+">";
00275 //cout << SHOW(tag1) << " ";
00276   bool res;
00277 
00278   res=atomize_next(tag1,iend_);
00279   if (res==false)
00280     return false;
00281   i1=current;
00282 
00283   string tag2="</"+tag+">";
00284 //cout << SHOW(tag2) << " ";
00285   res=atomize_next(tag2,iend_);
00286   if (res==false)
00287     return false;
00288   i2=current;
00289 
00290   // Default to reset iterator to first tag.
00291   current=i1;
00292 // cout << SHOW(*current) << " 1" << endl;
00293 
00294   return true;
00295 }

void tokenizer::extractfromcurrent ( vector< string > &  v,
stringc atom 
) const

The current line is read and parsed, split by the atom and each element is written to the vector v.

Definition at line 306 of file tokenizer.cpp.

00310 {
00311   v.clear();
00312 
00313   string s(*current);
00314 
00315   if (s.empty())
00316     return;
00317 
00318   string::size_type k;
00319 
00320   string::size_type const atomlen = atom.length();
00321 
00322   k = s.find(atom.c_str());
00323   for ( ; k!=string::npos; k=s.find(atom.c_str()) )
00324   {
00325     if (k==0)
00326     {
00327       s.erase(k,atomlen);
00328 
00329       continue;
00330     }
00331 
00332     v.push_back(s.substr(0,k));
00333     s.erase(0,k);
00334   }
00335 
00336   if (s.empty()==false)
00337     v.push_back(s);
00338 }

boolc tokenizer::find ( string::size_type &  k,
stringc atom,
string::size_type const   k0 
)

Search from k0 in the current string.

Definition at line 503 of file tokenizer.cpp.

References assertreturnfalse, and find().

00508 {
00509   string::size_type atomsize=atom.size();
00510   if (atomsize==0)
00511   {
00512     current=seq.end();
00513     return false;
00514   }
00515 
00516   liststringi i = current;
00517   assertreturnfalse(i!=seq.end());
00518 
00519   if (k0+atomsize-1<i->size())
00520   {
00521     k = i->find(atom.c_str(),k0);
00522     if (k!=string::npos)
00523       return true;
00524   }
00525     
00526   // failed to find in current string.
00527   ++current;
00528 
00529   return tokenizer::find(k,atom);
00530 }

boolc tokenizer::find ( string::size_type &  k,
stringc atom 
)

Non destructive text search.

Definition at line 533 of file tokenizer.cpp.

Referenced by find(), tokenizerfind::operator++(), tokenizerfind::reset(), and misclib_testcode::tokenizertest::unittest01().

00537 {
00538   liststringi i = current;
00539 
00540   for (;i!=seq.end(); ++i)
00541   {
00542     k = i->find(atom.c_str());
00543 
00544     if (k==string::npos)
00545       continue;
00546 
00547     current=i;
00548     return true;
00549   }
00550 
00551   return false;
00552 }

tokenizer::operator stringc (  ) 

Output the tokenizer back as a string.

Definition at line 480 of file tokenizer.cpp.

References reset().

00481 {
00482   string s;
00483   tokenizer& tk(*this);
00484   for ( tk.reset(); !tk; ++tk)
00485     { s += tk(); }; 
00486   
00487   return s; 
00488 }

boolc tokenizer::operator! (  )  const

Can the stream be read?

Definition at line 8 of file tokenizer.cpp.

References current, and seq.

00009 { 
00010   return (current != seq.end()); 
00011 }

stringc & tokenizer::operator() (  )  const

Access the current token in the stream.

Definition at line 19 of file tokenizer.cpp.

References current, and seq.

00020 { 
00021   assert(current != seq.end()); 
00022   return *current; 
00023 }

string & tokenizer::operator* (  ) 

Modify the current token.

Definition at line 13 of file tokenizer.cpp.

References current, and seq.

00014 { 
00015   assert(current != seq.end()); 
00016   return *current; 
00017 } 

void tokenizer::operator++ (  ) 

Increment the stream's index.

Definition at line 427 of file tokenizer.cpp.

References current, and seq.

00428 {
00429   if (current==seq.end())
00430     return;
00431 
00432   ++current;
00433 } 

boolc tokenizer::operator== ( tokenizer t2  ) 

Data copy.

Compare tokenizers by comparing each token.

Definition at line 452 of file tokenizer.cpp.

References reset().

00453 {
00454   reset();
00455   t2.reset();
00456   for ( ;!t2; ++t2 )
00457   {
00458     if (!(*this)==false)
00459       return false; 
00460 
00461     if ( (*this)() != t2() )
00462       return false;
00463 
00464     ++(*this);
00465   }
00466 
00467   if (!(*this))
00468     return false;
00469 
00470   return true;
00471 }

ostream & tokenizer::print ( ostream os  )  const

Print seq.

Set the printdelimiter to any string.

Definition at line 407 of file tokenizer.cpp.

References printdelimiter, and seq.

Referenced by operator<<().

00408 {
00409   liststringic i = seq.begin();
00410   liststringic iend2 = seq.end();
00411   if (i!=iend2)
00412     os << *i;
00413   ++i;
00414   for ( ; i!=iend2; ++i )
00415   {
00416     os << printdelimiter << *i;
00417   }
00418 
00419   return os;
00420 }

void tokenizer::read ( stringc data  ) 

Read the data in.

Definition at line 46 of file tokenizer.cpp.

References seq.

Referenced by modulereport::reset().

00047 { 
00048   seq.push_back(data); 
00049 }

void tokenizer::readaslines ( stringc data  ) 

Read the string as lines.

Each line is a string.

Definition at line 51 of file tokenizer.cpp.

References seq, and subtract().

Referenced by modulelist::buildlist(), projunittest::eval(), html::insert(), readaslinesgeneral(), misclib_testcode::tokenizertest::test05(), misclib_testcode::tokenizertest::test06(), and misclib_testcode::tokenizertest::test08().

00052 { 
00053   seq.push_back(data); subtract("\n"); 
00054 }

void tokenizer::readaslinesgeneral ( stringc data  ) 

Because special characters in strings can interfere with reading strings, this routine is a hack to remove white spaces, commas and empty lines.

Definition at line 392 of file tokenizer.cpp.

References readaslines(), remove_if(), subtract(), and trim().

Referenced by projunittests::eval(), simplexD2tessindexed< PT, PD, INDX >::serializeInverse(), simplexD1tessindexed< PT, PD, INDX >::serializeInverse(), and simplexD1listlinked< VI, INDX >::serializeInverse().

00393 {
00394   readaslines(data);
00395   subtract(",");
00396   subtract(" ");
00397   trim();
00398   remove_if();
00399 }

void tokenizer::remove ( stringc token  ) 

Remove matching tokens.

Definition at line 25 of file tokenizer.cpp.

References seq.

Referenced by misclib_testcode::tokenizertest::test04().

00026 { 
00027   seq.remove(token); 
00028 }

template<typename SPACER >
void tokenizer::remove_if ( SPACER  spacer  )  [inline]

Remove invalid lines.

Definition at line 112 of file tokenizer.h.

References seq.

00113     { seq.remove_if(spacer); }

void tokenizer::remove_if (  ) 

Remove invalid lines as interpreted by myspacer.

Definition at line 30 of file tokenizer.cpp.

References seq.

Referenced by tokenizermisc::comparewithoutspace(), readaslinesgeneral(), misclib_testcode::tokenizertest::test06(), misclib_testcode::tokenizertest::test08(), misclib_testcode::tokenizertest::test09(), tokenize(), and trim_and_prune().

00031 { 
00032   seq.remove_if(spacerdelete<>()); 
00033 }

void tokenizer::reset (  ) 

void tokenizer::stripcomment ( stringc comment  ) 

For each line delete right of and including the comment if it exists in the line.

Definition at line 123 of file tokenizer.cpp.

References seq.

Referenced by misclib_testcode::tokenizertest::test04(), misclib_testcode::tokenizertest::test06(), and misclib_testcode::tokenizertest::test08().

00124 {
00125   liststringi k = seq.begin();
00126   string::size_type i;
00127   for (;k!=seq.end(); ++k)
00128   {
00129     string & token(*k);
00130     i=0;
00131     i = token.find(comment,i);
00132     if (i==string::npos)
00133       continue;
00134 
00135     token.erase(i);
00136   }
00137 }

void tokenizer::subtract ( stringc atom  ) 

void tokenizer::tokenize (  ) 


delimiter set.

Definition at line 473 of file tokenizer.cpp.

References remove_if(), subtract(), and trim().

Referenced by gobjglColor3f::serializeInverse(), and gobjglVertex3f::serializeInverse().

00474 {
00475   subtract(" ");
00476   trim();
00477   remove_if();
00478 }

void tokenizer::trim (  ) 

void tokenizer::trim_and_prune (  ) 

Remove surrounding white space and delete empty strings.

Definition at line 40 of file tokenizer.cpp.

References remove_if(), and trim().

Referenced by projunittest::eval(), misclib_testcode::tokenizertest::test09(), and misclib_testcode::tokenizertest::unittest01().

00041 { 
00042   trim(); 
00043   remove_if(spacerdelete<>()); 
00044 }


Member Data Documentation

list<string> tokenizer::seq


The documentation for this class was generated from the following files:

Generated on Fri Mar 4 00:50:20 2011 for Chelton Evans Source by  doxygen 1.5.8