Searching in tree structures – PHP

  php

Q(Question):

I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).

I developed a fulltext search script for the CMS that will return
pages ordered by relevance.

This is fine, but I’ve now had a change request whereby searches can
be restricted to specific sites. This is a problem because the only
way to know what site an item belongs to is to fetch it, walk up it’s
tree and look at it’s root node.

I can see 2 possible solutions, but both have drawbacks. The first is
to just fetch all matches and then discard ones that don’t belong to
the specified site. This means increased DB load and traffic between
the DB and the webserver and could potentially cause quite a
performance hit when the number of pages being dealt with climbs into
the thousands.

The other solution is to, in addition to the parent node, also store
the root node in the database. This would give an extra field to
match against and make eliminating items with an SQL query, thus
reducing load and traffic, quite simple. However the big drawback is
that there are now two fields determining where an item fits into the
tree structure instead of just one, making it an obvious source of
potential error. It would also mean updating code related to item
creation updating that wouldn’t have to be touched with the other
approach.

If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

A(Answer):

Gordon wrote:

I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).

I developed a fulltext search script for the CMS that will return
pages ordered by relevance.

This is fine, but I’ve now had a change request whereby searches can
be restricted to specific sites. This is a problem because the only
way to know what site an item belongs to is to fetch it, walk up it’s
tree and look at it’s root node.

I can see 2 possible solutions, but both have drawbacks. The first is
to just fetch all matches and then discard ones that don’t belong to
the specified site. This means increased DB load and traffic between
the DB and the webserver and could potentially cause quite a
performance hit when the number of pages being dealt with climbs into
the thousands.

The other solution is to, in addition to the parent node, also store
the root node in the database. This would give an extra field to
match against and make eliminating items with an SQL query, thus
reducing load and traffic, quite simple. However the big drawback is
that there are now two fields determining where an item fits into the
tree structure instead of just one, making it an obvious source of
potential error. It would also mean updating code related to item
creation updating that wouldn’t have to be touched with the other
approach.

If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

Not from the PHP end. You will have to retrieve the data and throw away
what you don’t want.

If you’re looking for a database solution, you should try the
appropriate database newsgroup.


==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

A(Answer):

Gordon wrote:

I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).

I developed a fulltext search script for the CMS that will return
pages ordered by relevance.

This is fine, but I’ve now had a change request whereby searches can
be restricted to specific sites. This is a problem because the only
way to know what site an item belongs to is to fetch it, walk up it’s
tree and look at it’s root node.

I can see 2 possible solutions, but both have drawbacks. The first is
to just fetch all matches and then discard ones that don’t belong to
the specified site. This means increased DB load and traffic between
the DB and the webserver and could potentially cause quite a
performance hit when the number of pages being dealt with climbs into
the thousands.

The other solution is to, in addition to the parent node, also store
the root node in the database. This would give an extra field to
match against and make eliminating items with an SQL query, thus
reducing load and traffic, quite simple. However the big drawback is
that there are now two fields determining where an item fits into the
tree structure instead of just one, making it an obvious source of
potential error. It would also mean updating code related to item
creation updating that wouldn’t have to be touched with the other
approach.

If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

I opt for the second. It falls into the realm of code once, thoroughly
test, forget about it. A severe performance hit, OTOH, is always present.

A(Answer):

Gordon escribió:

I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).

[…]

If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

There are many DBMS-specific solutions to handle hierarchical data. What
database types does your CMS need to support?

In any case, unless your tree is too big you can:

1. Fetch the complete tree structure (1 DB query and not much data, just
IDs)
2. Find the IDs of the right nodes in PHP
3. Fetch the desired info filtering by node: WHERE NODE_ID IN (….)

A couple of links I’ve gathered:

[MySQL+InnoDB] http://jan.kneschke.de/projects/mysql/sp
[Oracle] http://philip.greenspun.com/sql/trees.html

http://alvaro.es – Álvaro G. Vicario – Burgos, Spain
— Mi sitio sobre programación web: http://bits.demogracia.com
— Mi web de humor al baño María: http://www.demogracia.com

A(Answer):

On May 20, 1:00 pm, sheldonlg <sheldonlgwrote:

Gordon wrote:

I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).

I developed a fulltext search script for the CMS that will return
pages ordered by relevance.

This is fine, but I’ve now had a change request whereby searches can
be restricted to specific sites. This is a problem because the only
way to know what site an item belongs to is to fetch it, walk up it’s
tree and look at it’s root node.

I can see 2 possible solutions, but both have drawbacks. The first is
to just fetch all matches and then discard ones that don’t belong to
the specified site. This means increased DB load and traffic between
the DB and the webserver and could potentially cause quite a
performance hit when the number of pages being dealt with climbs into
the thousands.

The other solution is to, in addition to the parent node, also store
the root node in the database. This would give an extra field to
match against and make eliminating items with an SQL query, thus
reducing load and traffic, quite simple. However the big drawback is
that there are now two fields determining where an item fits into the
tree structure instead of just one, making it an obvious source of
potential error. It would also mean updating code related to item
creation updating that wouldn’t have to be touched with the other
approach.

If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

I opt for the second. It falls into the realm of code once, thoroughly
test, forget about it. A severe performance hit, OTOH, is always present.

That’s a good point. I suppose if I’m sufficiantly careful then the 2
pointers solution offers almost no performance hit as opposed to the
other approach. I just get a bit paranoid about that kind of thing, I
had it drilled into me during my education on databases you don’t
store redundant data because of the problems it can cause if it
desyncs. In this case the tradeoff is probably worth it.

A(Answer):

On May 20, 1:06 pm, "Álvaro G. Vicario"
<alvaroNOSPAMTHA…@demogracia.comwrote:

Gordon escribió:

I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).

[…]

If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

There are many DBMS-specific solutions to handle hierarchical data. What
database types does your CMS need to support?

In any case, unless your tree is too big you can:

1. Fetch the complete tree structure (1 DB query and not much data, just
IDs)
2. Find the IDs of the right nodes in PHP
3. Fetch the desired info filtering by node: WHERE NODE_ID IN (….)

A couple of links I’ve gathered:

[MySQL+InnoDB] http://jan.kneschke.de/projects/mysql/sp
[Oracle]http://philip.greenspun.com/sql/trees.html


–http://alvaro.es- Álvaro G. Vicario – Burgos, Spain
— Mi sitio sobre programación web:http://bits.demogracia.com
— Mi web de humor al baño María:http://www.demogracia.com

Thanks for the pointers. The CMS I’m working on is using Postgres
8.3.1 as the backend. Very good fulltext search once you’ve gotten
used to its quirks. 🙂

A(Answer):

Gordon wrote:

On May 20, 1:00 pm, sheldonlg <sheldonlgwrote:

>Gordon wrote:

>>I’ve developed a CMS that manages content across several sites. The
content is represented in a tree structure maintained in a database,
where each item has an id and a parent that points to the containing
item or 0 if it’s a root item (a site).
I developed a fulltext search script for the CMS that will return
pages ordered by relevance.
This is fine, but I’ve now had a change request whereby searches can
be restricted to specific sites. This is a problem because the only
way to know what site an item belongs to is to fetch it, walk up it’s
tree and look at it’s root node.
I can see 2 possible solutions, but both have drawbacks. The first is
to just fetch all matches and then discard ones that don’t belong to
the specified site. This means increased DB load and traffic between
the DB and the webserver and could potentially cause quite a
performance hit when the number of pages being dealt with climbs into
the thousands.
The other solution is to, in addition to the parent node, also store
the root node in the database. This would give an extra field to
match against and make eliminating items with an SQL query, thus
reducing load and traffic, quite simple. However the big drawback is
that there are now two fields determining where an item fits into the
tree structure instead of just one, making it an obvious source of
potential error. It would also mean updating code related to item
creation updating that wouldn’t have to be touched with the other
approach.
If any of you guys out there have another potential solution that I
might have not thought of I’d like to hear what they are. Failing
that, which of the options I do have would you recommend?

I opt for the second. It falls into the realm of code once, thoroughly
test, forget about it. A severe performance hit, OTOH, is always present.

That’s a good point. I suppose if I’m sufficiantly careful then the 2
pointers solution offers almost no performance hit as opposed to the
other approach. I just get a bit paranoid about that kind of thing, I
had it drilled into me during my education on databases you don’t
store redundant data because of the problems it can cause if it
desyncs. In this case the tradeoff is probably worth it.

Depending on the database, it might be possible with the structure you
have now. But you need to be asking in a newsgroup for your database.


==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

LEAVE A COMMENT