backends

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

backends

Aaron Griffin
Split.

On 10/26/06, Roman Kyrylych <[hidden email]> wrote:

> 2006/10/26, Aaron Griffin <[hidden email]>:
> > On 10/26/06, Roman Kyrylych <[hidden email]> wrote:
> > > [offtopic on]
> > > BTW, about complex patches - I have a patch from Martin Devera that
> > > adds support for gdbm database backend to pacman 2.9.8. I suggested
> > > him to contact VMiklos for help in inclusion it in pacman, but I don't
> > > know if he did that. If I adapt it for Pacman 3 API is there a chance
> > > it will be included at least in CVS/DARCS?
> > > [/offtopic off]
> >
> > There's no chance that would get included in pacman 2.9.8 - it will die soon.
> >
> > Converting it to pacman3 is entirely doable and it's already setup to
> > do so.  You should be able to copy be_files.c to be_gdbm.c and consume
> > the interface provided there.
> >
> > FTR 'be_' means 'backend'.  I still need to add compile-time flags for
> > selecting a backend, but as we only provide _files at the moment,
> > there's no need.
> >
> > I worked on a 'compressed' backend that required no changes, but read
> > directly from the downloaded db files... it caused problems on the
> > local db, however, so I never completed it (yet).
> >
> > As for backend schemes, I think we need to discuss avenues of attack
> > here.  The files backend is painfully slow on ext filesystems (11
> > seconds to parse [community]).  There are many users touting a
> > database backend as a good idea (I highly disagree), and if we do
> > this, we NEED a generic layer to support any sane database format, not
> > just one specific thing.  Using a flat-file database is a better
> > option, but here people seem to think sqlite is a good idea (it's not)
> > - there are many flat-file schemes that can drastically outperform
> > sqlite.
> >
> > There are many many other options, and I think it's worth at least
> > _supplying_ multiple backends, even if they're not used.  So gdbm is
> > cool.
>
> Maybe we should start separate thread for this?
> Personally, I think sqlite will be the best choice because of:
> 1) it's small
> 2) it's fast
> 3) it's embedable

Ok, sqlite is "fast and etc etc", right? So if I do [pacman -Ss foo],
what does sqlite do?  It opens the file (exactly as plaintext
searching would), initializes some structures and things like that
describing table data, then proceeds to sequentially search (FULL
TABLE SCAN) through all rows.  It is impossible, even in a good
database, to index substrings.  Yes, you can index entire strings,
that's easy.  The search operations check internal substrings, not
whole strings.
You could make this faster by converting each entry into a suffix tree
and storing that somewhere, but that gets irritating and gets into
that crappy Computer Science stuf everyone hates.

Fact of the matter is, there's two main "slow" operations - searching,
and loading the whole DB (in the case of an -Su).  Loading the whole
database can easilly be sped up by using lazy  package evaluation -
that is, a simple readdir() or the db directory gives us package name,
version, and release - all we need for a upgrade check.  The
additional data can be read when/if required.  Substring searching
will never be uber fast, but we can maintain a minor indexed file with
all text that's searched... something like:

extra/package-name-1.0-1 : This is the description for the package
community/package-two-1.1.1-1 : This is the description for this thing
extra/another-99-1 : This is the description for some stuff

-Ss can check this file, and use the first entry before the colon to
construct the path to open (though that may not be needed as all the
information is right there).

Ok, I'm gonna stop myself before I rant too much.  There are cries of
"use a database" from many people.  Frankly, I don't think that these
people have evaluated all the options and are stuck in the "everyone
uses databases for their websites" mentality.

Take this for instance: The google engineers here in Chicago explained
to me something they use to store whatever the hell google stores
there: A custom filesystem tooled to their spec.

They don't use mysql or oracle or sqlite any of that - the king of
searching uses the simplest of all options.

> 4) it offers full power of SQL99 (with ACID transactions!!!)

Nope. Not at all.  It doesn't even support SQL92 - a 14 year old standard.
http://www.sqlite.org/omitted.html
http://www.sqlite.org/cvstrac/wiki?p=UnsupportedSql
They have said publicly they do not plan on adding support.  Standards
compliance is HUGE in my book.

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Roman Kyrylych
2006/10/26, Aaron Griffin <[hidden email]>:

> > Personally, I think sqlite will be the best choice because of:
> > 1) it's small
> > 2) it's fast
> > 3) it's embedable
>
> Ok, sqlite is "fast and etc etc", right? So if I do [pacman -Ss foo],
> what does sqlite do?  It opens the file (exactly as plaintext
> searching would), initializes some structures and things like that
> describing table data, then proceeds to sequentially search (FULL
> TABLE SCAN) through all rows.  It is impossible, even in a good
> database, to index substrings.  Yes, you can index entire strings,
> that's easy.  The search operations check internal substrings, not
> whole strings.
> . . . . .
> . . . . .
I don't get what you whant to say here. Using SQLite will not allow
full text search to be faster or slower, but easier, because all
needed functions are already there.
You don't need to do complex indexing. Indexing by package name is enough.
By terms "small, fast, embedable" I mean that it's better than MySQL
or something else. And it's better than dbm-style databases when it
comes to random writing and transactions.
And yes, it is faster than filesystem backend, because there is a
_simple_ way to use _in-memory_ databases! ;-)

> > 4) it offers full power of SQL99 (with ACID transactions!!!)
>
> Nope. Not at all.  It doesn't even support SQL92 - a 14 year old standard.
Oops, my bad, I confused SQL99 with SQL92. :-p
Anyway, SQLite offers many functions almost for free.
And anyway there should be a choice of files/gdbm/sqlite/whatever.

--
Roman Kyrylych (Роман Кирилич)
_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Aaron Griffin
On 10/26/06, Roman Kyrylych <[hidden email]> wrote:
> You don't need to do complex indexing. Indexing by package name is enough.
> By terms "small, fast, embedable" I mean that it's better than MySQL
> or something else. And it's better than dbm-style databases when it
> comes to random writing and transactions.

What I'm saying is that you have the overhead of a full-scale database
with little gain.  Indexing by package names, yes that's great and
all, but that doesn't help with the slowest-of-the-slow -Ss operation.
 -Ss searches package names AND descriptions, allowing for regex
matching.  Sqlite (and most DBs) do not support regex matching.
Indexing the search by package name means little, because a -Ss "foo"
still SHOULD match a package named "barfoo" and a package with "foo"
in the description.  This means a sequential search.  Not only that,
but because the DB will not support regex, that means that one must
iterate over each and every entry, get the values at the C level,
apply a regex pattern, and note if it is a match or not.  The only
speed you gain would be in the initial opening of the files.  To me,
this does not mean a DB backend is the solution.  It may be better
than the files backend, yes, but not the best, and outperforming the
files backend is not hard... I was able to improve performance
approximately 6 times by simply using the db.tar.gz files in place of
disparate text files.

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Roman Kyrylych
2006/10/26, Aaron Griffin <[hidden email]>:

> What I'm saying is that you have the overhead of a full-scale database
> with little gain.  Indexing by package names, yes that's great and
> all, but that doesn't help with the slowest-of-the-slow -Ss operation.
>  -Ss searches package names AND descriptions, allowing for regex
> matching.  Sqlite (and most DBs) do not support regex matching.
> Indexing the search by package name means little, because a -Ss "foo"
> still SHOULD match a package named "barfoo" and a package with "foo"
> in the description.  This means a sequential search.  Not only that,
> but because the DB will not support regex, that means that one must
> iterate over each and every entry, get the values at the C level,
> apply a regex pattern, and note if it is a match or not.  The only
> speed you gain would be in the initial opening of the files.  To me,
> this does not mean a DB backend is the solution.  It may be better
> than the files backend, yes, but not the best, and outperforming the
> files backend is not hard... I was able to improve performance
> approximately 6 times by simply using the db.tar.gz files in place of
> disparate text files.

Yes, the main problem with files backend is that huge amount of files
in /var/lib/pacman that leads to slow performance on many filesystems.
gdbm/sqlite/whatever has a big "+" that everithing can be in one file.
Plus, as I said, with SQLite you have the ability to easily use
in-memory database + you have easy ACID transactions support. This
will be faster than seeking through the /var/lib/pacman/ anyway. And
there will be no need for pacman-optimize-like scripts for database
backends.
The overhead is not big. (A quote from sqlite.org: "Small code
footprint: less than 250KiB fully configured or less than 150KiB with
optional features omitted")

Anyway nobody imposes use of some database backend as default and the
only one. But having alternative is good. Users can test performance
by themself and choose the solution that fits their needs in the best
way. Otherwise, what's the point of the ALPM's ability to have
different backends? ;-)
About regexps - yes, SQLite doesn't support them, but adding this on
top of it will be not slower than in files backend anyway. I don't see
a problem here.

I'm not "advertising" SQLite or other databases because "everybody use
databases for everything". I just think that this will offer some new
abilities for libalpm and will simplify (and speedup, yes - speedup)
some things. That's IMHO, of course, but I think it worth a try gdbm
and sqlite as alternatives to files backend.

--
Roman Kyrylych (Роман Кирилич)
_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Jürgen Hötzel
On Fri, Oct 27, 2006 at 12:03:40AM +0300, Roman Kyrylych wrote:

> 2006/10/26, Aaron Griffin <[hidden email]>:
> > What I'm saying is that you have the overhead of a full-scale database
> > with little gain.  Indexing by package names, yes that's great and
> > all, but that doesn't help with the slowest-of-the-slow -Ss operation.
> >  -Ss searches package names AND descriptions, allowing for regex
> > matching.  Sqlite (and most DBs) do not support regex matching.
> > Indexing the search by package name means little, because a -Ss "foo"
> > still SHOULD match a package named "barfoo" and a package with "foo"
> > in the description.  This means a sequential search.  Not only that,
> > but because the DB will not support regex, that means that one must
> > iterate over each and every entry, get the values at the C level,
> > apply a regex pattern, and note if it is a match or not.  The only
> > speed you gain would be in the initial opening of the files.  To me,
> > this does not mean a DB backend is the solution.  It may be better
> > than the files backend, yes, but not the best, and outperforming the
> > files backend is not hard... I was able to improve performance
> > approximately 6 times by simply using the db.tar.gz files in place of
> > disparate text files.
>
> Yes, the main problem with files backend is that huge amount of files
> in /var/lib/pacman that leads to slow performance on many filesystems.
> gdbm/sqlite/whatever has a big "+" that everithing can be in one file.
This is a big "-", because unix lovers prefer multiple text files. This
makes it easy to use powerful text/file-processing tools like "sed",
"awk", "shell" and "find" to build package-management related scripts.

> Plus, as I said, with SQLite you have the ability to easily use
> in-memory database + you have easy ACID transactions support. This
> will be faster than seeking through the /var/lib/pacman/ anyway. And
> there will be no need for pacman-optimize-like scripts for database
> backends.
> The overhead is not big. (A quote from sqlite.org: "Small code
> footprint: less than 250KiB fully configured or less than 150KiB with
> optional features omitted")

You will gain little performance (some seconds on a slow system?) at the
expense of complexity.

Jürgen

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: backends

Roman Kyrylych
2006/10/27, Jürgen Hötzel <[hidden email]>:
> > Yes, the main problem with files backend is that huge amount of files
> > in /var/lib/pacman that leads to slow performance on many filesystems.
> > gdbm/sqlite/whatever has a big "+" that everithing can be in one file.
>
> This is a big "-", because unix lovers prefer multiple text files. This
> makes it easy to use powerful text/file-processing tools like "sed",
> "awk", "shell" and "find" to build package-management related scripts.

There're nothing that don't allow to create some "API" of shell
functions for doing most common operations that can be used by other
shell scripts. And this way will be more backward compatible in case
there will be some changes to storage format.

> > Plus, as I said, with SQLite you have the ability to easily use
> > in-memory database + you have easy ACID transactions support. This
> > will be faster than seeking through the /var/lib/pacman/ anyway. And
> > there will be no need for pacman-optimize-like scripts for database
> > backends.
> > The overhead is not big. (A quote from sqlite.org: "Small code
> > footprint: less than 250KiB fully configured or less than 150KiB with
> > optional features omitted")
>
> You will gain little performance (some seconds on a slow system?) at the
> expense of complexity.

Why not make a benchmark? ;-) Really, there are some users that
complain about pacman slowness, and it's not about really slow
operations like -Ss, but about really trivial operations, that
shouldn't be so slow. That's because files backend is highly dependent
on the type of underlying filesystem. Database backend may make these
users happy,because it's not dependent on filesystem type.

Personally, I don't think there's a problem _at_all_! Anyway files
backend will be default in pacman3, and anyway I doubt it will be
replaced soon (if ever). Why not just have an _alternative_?

Keep in mind that ALPM can and will be used by other distros. Except
Arch Linux and Frugalware there're Underground Linux (mostly Arch
Linux Install CD + KDE and other desktop software) and Rubix Linux
(formally, simply Rubix, anyway it's dead now), but I'm sure there
will be some more in future (though I don't think more than 300
distros is good, even DistroWatch author thinks the same). Other
distros may choose another backend. Of course, they can implement them
by themselves, but anyway there was a reason why multiple backend
fearure was included into ALPM. So why not to use it? Especially when
it does not impose anything. Having alternatives is good, IMHO.

--
Roman Kyrylych (Роман Кирилич)
_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Aaron Griffin
On 10/26/06, Roman Kyrylych <[hidden email]> wrote:
> Why not make a benchmark? ;-) Really, there are some users that
> complain about pacman slowness, and it's not about really slow
> operations like -Ss, but about really trivial operations, that
> shouldn't be so slow. That's because files backend is highly dependent
> on the type of underlying filesystem. Database backend may make these
> users happy,because it's not dependent on filesystem type.

No no no.... you're missing the point.  Because reading individual
files is slow, we need a new backend.  This does not imply a database
backend is a good idea.  It implies a single-file solution is ideal,
yes.  What I've been trying to show was that, considering the
slow-down is from reading hundreds of files, that does not imply a
database.  ANY single-file solution would work.

That is why I was on about string matching and things of that nature -
because if we take "single file solution A" and compare it to "DB
solution A", there are negligable differences, and the DB one adds
much more complexity and security flaws (i.e. embedded SQL statement
strings).

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Roman Kyrylych
2006/10/27, Aaron Griffin <[hidden email]>:
> No no no.... you're missing the point.  Because reading individual
> files is slow, we need a new backend.  This does not imply a database
> backend is a good idea.  It implies a single-file solution is ideal,
> yes.  What I've been trying to show was that, considering the
> slow-down is from reading hundreds of files, that does not imply a
> database.  ANY single-file solution would work.

Yes, I know this. You just get words out of my mouth! :)
But single file will need some structure anyway. So, either we'll have
our own format, or use something like gdbm or even simpler, or
something more complex like sqlite.

> That is why I was on about string matching and things of that nature -
> because if we take "single file solution A" and compare it to "DB
> solution A", there are negligable differences, and the DB one adds
> much more complexity and security flaws (i.e. embedded SQL statement
> strings).

OK, so, what do you propose? Use a simple tar.gz and process it in
memory, or what?

Still, I think sqlite or gdbm can be tried, but that's will be not in
the near future anyway.
There're more important things now anyway. :-)

--
Roman Kyrylych (Роман Кирилич)
_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Maciej Borzecki
On Fri, 27 Oct 2006 01:22:03 +0300
"Roman Kyrylych" <[hidden email]> wrote:

>
> Still, I think sqlite or gdbm can be tried, but that's will be not in
> the near future anyway.
> There're more important things now anyway. :-)
>
As for gdbm backend, I have been going through adding it to pacman3.
However I have been working on some quite old pacman3 cvs sources
(seems like updating will be quite painful given the recent
discussions on the list). Depending on the amount if free time I will
have, I may provide the sources in 2 weeks or so, but I would consider
this as a proof of concept rathern than release (or anything
close to it) version.

Maciek

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Judd Vinet
In reply to this post by Aaron Griffin
On Thu, 26 Oct 2006 17:11:05 -0500
"Aaron Griffin" <[hidden email]> wrote:
> No no no.... you're missing the point.  Because reading individual
> files is slow, we need a new backend.  This does not imply a database
> backend is a good idea.  It implies a single-file solution is ideal,
> yes.  What I've been trying to show was that, considering the
> slow-down is from reading hundreds of files, that does not imply a
> database.  ANY single-file solution would work.

Heh, well almost any.  See pacman1 for an implementation that did not
work.   Well, it worked, but it was painfully slow -- sequential
searching through a massive text file to find the package in question.


- J

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev
Reply | Threaded
Open this post in threaded view
|

Re: backends

Aaron Griffin
On 10/27/06, Judd Vinet <[hidden email]> wrote:

> On Thu, 26 Oct 2006 17:11:05 -0500
> "Aaron Griffin" <[hidden email]> wrote:
> > No no no.... you're missing the point.  Because reading individual
> > files is slow, we need a new backend.  This does not imply a database
> > backend is a good idea.  It implies a single-file solution is ideal,
> > yes.  What I've been trying to show was that, considering the
> > slow-down is from reading hundreds of files, that does not imply a
> > database.  ANY single-file solution would work.
>
> Heh, well almost any.  See pacman1 for an implementation that did not
> work.   Well, it worked, but it was painfully slow -- sequential
> searching through a massive text file to find the package in question.

Ok, "any SANE single-file solution" 8)

Though I have seen some decent flat-file systems in the past (VMS's
RMS files were fairly fast if you only used one file, though "joining"
was terriblly slow)

_______________________________________________
pacman-dev mailing list
[hidden email]
http://www.archlinux.org/mailman/listinfo/pacman-dev